GBM Discussion

The data here is taken form the Data Hackathon3.x - http://datahack.analyticsvidhya.com/contest/data-hackathon-3x

Import Libraries:



In [1]:

    
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import cross_validation, metrics
from sklearn.grid_search import GridSearchCV

import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4

Load Data:

The data has gone through following pre-processing:

City variable dropped because of too many categories
DOB converted to Age | DOB dropped
EMI_Loan_Submitted_Missing created which is 1 if EMI_Loan_Submitted was missing else 0 | EMI_Loan_Submitted dropped
EmployerName dropped because of too many categories
Existing_EMI imputed with 0 (median) - 111 values were missing
Interest_Rate_Missing created which is 1 if Interest_Rate was missing else 0 | Interest_Rate dropped
Lead_Creation_Date dropped because made little intuitive impact on outcome
Loan_Amount_Applied, Loan_Tenure_Applied imputed with missing
Loan_Amount_Submitted_Missing created which is 1 if Loan_Amount_Submitted was missing else 0 | Loan_Amount_Submitted dropped
Loan_Tenure_Submitted_Missing created which is 1 if Loan_Tenure_Submitted was missing else 0 | Loan_Tenure_Submitted dropped
LoggedIn, Salary_Account removed
Processing_Fee_Missing created which is 1 if Processing_Fee was missing else 0 | Processing_Fee dropped
Source - top 2 kept as is and all others combined into different category
Numerical and One-Hot-Coding performed



In [2]:

    
train = pd.read_csv('train_modified.csv')



In [3]:

    
target='Disbursed'
IDcol = 'ID'



In [6]:

    
train['Disbursed'].value_counts()









    Out[6]:





0.0    85747
1.0     1273
Name: Disbursed, dtype: int64



In [ ]:

Define a function for modeling and cross-validation

This function will do the following:

fit the model
determine training accuracy
determine training AUC
determine testing AUC
perform CV is performCV is True
plot Feature Importance if printFeatureImportance is True



In [7]:

    
def modelfit(alg, dtrain, dtest, predictors, performCV=True, printFeatureImportance=True, cv_folds=5):
    #Fit the algorithm on the data
    alg.fit(dtrain[predictors], dtrain['Disbursed'])
        
    #Predict training set:
    dtrain_predictions = alg.predict(dtrain[predictors])
    dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]
    
    #Perform cross-validation:
    if performCV:
        cv_score = cross_validation.cross_val_score(alg, dtrain[predictors], dtrain['Disbursed'], cv=cv_folds, scoring='roc_auc')
    
    #Print model report:
    print "\nModel Report"
    print "Accuracy : %.4g" % metrics.accuracy_score(dtrain['Disbursed'].values, dtrain_predictions)
    print "AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['Disbursed'], dtrain_predprob)
    
    if performCV:
        print "CV Score : Mean - %.7g | Std - %.7g | Min - %.7g | Max - %.7g" % (np.mean(cv_score),np.std(cv_score),np.min(cv_score),np.max(cv_score))
                
    #Print Feature Importance:
    if printFeatureImportance:
        feat_imp = pd.Series(alg.feature_importances_, predictors).sort_values(ascending=False)
        feat_imp.plot(kind='bar', title='Feature Importances')
        plt.ylabel('Feature Importance Score')









    



  File "<ipython-input-7-6e1003e5e61f>", line 14
    print "\nModel Report"
                         ^
SyntaxError: Missing parentheses in call to 'print'

Baseline Model

Since here the criteria is AUC, simply predicting the most prominent class would give an AUC of 0.5 always. Another way of getting a baseline model is to use the algorithm without tuning, i.e. with default parameters.



In [9]:

    
#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
gbm0 = GradientBoostingClassifier(random_state=10)
modelfit(gbm0, train, test, predictors,printOOB=False)









    



Model Report
Accuracy : 0.9856
AUC Score (Train): 0.862264
CV Score : Mean - 0.8319 | Std - 0.008757 | Min - 0.8208 | Max - 0.8439

GBM Models:

There 2 types of parameters here:

Tree-specific parameters
- min_samples_split
- min_samples_leaf
- max_depth
- min_leaf_nodes
- max_features
- loss function
Boosting specific paramters
- n_estimators
- learning_rate
- subsample

Approach for tackling the problem

Decide a relatively higher value for learning rate and tune the number of estimators requried for that.
Tune the tree specific parameters for that learning rate
Tune subsample
Lower learning rate as much as possible computationally and increase the number of estimators accordingly.

Step 1- Find the number of estimators for a high learning rate

We will use the following benchmarks for parameters:

min_samples_split = 500 : ~0.5-1% of total values. Since this is imbalanced class problem, we'll take small value
min_samples_leaf = 50 : Just using for preventing overfitting. will be tuned later.
max_depth = 8 : since high number of observations and predictors, choose relatively high value
max_features = 'sqrt' : general thumbrule to start with
subsample = 0.8 : typically used value (will be tuned later)

0.1 is assumed to be a good learning rate to start with. Let's try to find the optimum number of estimators requried for this.



In [10]:

    
#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
param_test1 = {'n_estimators':range(20,81,10)}
gsearch1 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, min_samples_split=500,
                                  min_samples_leaf=50,max_depth=8,max_features='sqrt', subsample=0.8,random_state=10), 
                       param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch1.fit(train[predictors],train[target])









    Out[10]:





GridSearchCV(cv=5, error_score='raise',
       estimator=GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=8, max_features='sqrt', max_leaf_nodes=None,
              min_samples_leaf=50, min_samples_split=500,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=10, subsample=0.8, verbose=0,
              warm_start=False),
       fit_params={}, iid=False, n_jobs=4,
       param_grid={'n_estimators': [20, 30, 40, 50, 60, 70, 80]},
       pre_dispatch='2*n_jobs', refit=True, scoring='roc_auc', verbose=0)



In [11]:

    
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_









    Out[11]:





([mean: 0.83322, std: 0.00985, params: {'n_estimators': 20},
  mean: 0.83684, std: 0.00986, params: {'n_estimators': 30},
  mean: 0.83752, std: 0.00978, params: {'n_estimators': 40},
  mean: 0.83761, std: 0.00991, params: {'n_estimators': 50},
  mean: 0.83843, std: 0.00987, params: {'n_estimators': 60},
  mean: 0.83832, std: 0.00956, params: {'n_estimators': 70},
  mean: 0.83764, std: 0.01001, params: {'n_estimators': 80}],
 {'n_estimators': 60},
 0.83842766395593704)

So we got 60 as the optimal estimators for the 0.1 learning rate. Note that 60 is a reasonable value and can be used as it is. But it might not be the same in all cases. Other situations:

If the value is around 20, you might want to try lowering the learning rate to 0.05 and re-run grid search
If the values are too high ~100, tuning the other parameters will take long time and you can try a higher learning rate

Step 2- Tune tree-specific parameters

Now, lets move onto tuning the tree parameters. We will do this in 2 stages:

Tune max_depth and num_samples_split
Tune min_samples_leaf
Tune max_features



In [13]:

    
#Grid seach on subsample and max_features
param_test2 = {'max_depth':range(5,16,2), 'min_samples_split':range(200,1001,200)}
gsearch2 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60,
                                                max_features='sqrt', subsample=0.8, random_state=10), 
                       param_grid = param_test2, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch2.fit(train[predictors],train[target])









    Out[13]:





GridSearchCV(cv=5, error_score='raise',
       estimator=GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=3, max_features='sqrt', max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=60,
              presort='auto', random_state=10, subsample=0.8, verbose=0,
              warm_start=False),
       fit_params={}, iid=False, n_jobs=4,
       param_grid={'min_samples_split': [200, 400, 600, 800, 1000], 'max_depth': [5, 7, 9, 11, 13, 15]},
       pre_dispatch='2*n_jobs', refit=True, scoring='roc_auc', verbose=0)



In [14]:

    
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_









    Out[14]:





([mean: 0.83256, std: 0.01272, params: {'min_samples_split': 200, 'max_depth': 5},
  mean: 0.83285, std: 0.01016, params: {'min_samples_split': 400, 'max_depth': 5},
  mean: 0.83386, std: 0.01415, params: {'min_samples_split': 600, 'max_depth': 5},
  mean: 0.83379, std: 0.01169, params: {'min_samples_split': 800, 'max_depth': 5},
  mean: 0.83338, std: 0.01268, params: {'min_samples_split': 1000, 'max_depth': 5},
  mean: 0.83390, std: 0.00759, params: {'min_samples_split': 200, 'max_depth': 7},
  mean: 0.83660, std: 0.00994, params: {'min_samples_split': 400, 'max_depth': 7},
  mean: 0.83481, std: 0.00827, params: {'min_samples_split': 600, 'max_depth': 7},
  mean: 0.83788, std: 0.01066, params: {'min_samples_split': 800, 'max_depth': 7},
  mean: 0.83769, std: 0.01060, params: {'min_samples_split': 1000, 'max_depth': 7},
  mean: 0.83631, std: 0.00942, params: {'min_samples_split': 200, 'max_depth': 9},
  mean: 0.83695, std: 0.00923, params: {'min_samples_split': 400, 'max_depth': 9},
  mean: 0.83339, std: 0.00893, params: {'min_samples_split': 600, 'max_depth': 9},
  mean: 0.83793, std: 0.00965, params: {'min_samples_split': 800, 'max_depth': 9},
  mean: 0.83844, std: 0.00954, params: {'min_samples_split': 1000, 'max_depth': 9},
  mean: 0.83036, std: 0.00998, params: {'min_samples_split': 200, 'max_depth': 11},
  mean: 0.83077, std: 0.00809, params: {'min_samples_split': 400, 'max_depth': 11},
  mean: 0.83366, std: 0.00983, params: {'min_samples_split': 600, 'max_depth': 11},
  mean: 0.83193, std: 0.00911, params: {'min_samples_split': 800, 'max_depth': 11},
  mean: 0.83582, std: 0.01040, params: {'min_samples_split': 1000, 'max_depth': 11},
  mean: 0.82198, std: 0.01037, params: {'min_samples_split': 200, 'max_depth': 13},
  mean: 0.83055, std: 0.00837, params: {'min_samples_split': 400, 'max_depth': 13},
  mean: 0.83139, std: 0.01127, params: {'min_samples_split': 600, 'max_depth': 13},
  mean: 0.83403, std: 0.01060, params: {'min_samples_split': 800, 'max_depth': 13},
  mean: 0.83288, std: 0.00974, params: {'min_samples_split': 1000, 'max_depth': 13},
  mean: 0.82009, std: 0.00691, params: {'min_samples_split': 200, 'max_depth': 15},
  mean: 0.82317, std: 0.01017, params: {'min_samples_split': 400, 'max_depth': 15},
  mean: 0.82909, std: 0.00904, params: {'min_samples_split': 600, 'max_depth': 15},
  mean: 0.82926, std: 0.00944, params: {'min_samples_split': 800, 'max_depth': 15},
  mean: 0.83236, std: 0.01421, params: {'min_samples_split': 1000, 'max_depth': 15}],
 {'max_depth': 9, 'min_samples_split': 1000},
 0.83843938077464664)

Since we reached the maximum of min_sales_split, we should check higher values as well. Also, we can tune min_samples_leaf with it now as max_depth is fixed. One might argue that max depth might change for higher value but if you observe the output closely, a max_depth of 9 had a better model for most of cases. So lets perform a grid search on them:



In [17]:

    
#Grid seach on subsample and max_features
param_test3 = {'min_samples_split':range(1000,2100,200), 'min_samples_leaf':range(30,71,10)}
gsearch3 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60,max_depth=9,
                                                    max_features='sqrt', subsample=0.8, random_state=10), 
                       param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch3.fit(train[predictors],train[target])









    Out[17]:





GridSearchCV(cv=5, error_score='raise',
       estimator=GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=9, max_features='sqrt', max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=60,
              presort='auto', random_state=10, subsample=0.8, verbose=0,
              warm_start=False),
       fit_params={}, iid=False, n_jobs=4,
       param_grid={'min_samples_split': [1000, 1200, 1400, 1600, 1800, 2000], 'min_samples_leaf': [30, 40, 50, 60, 70]},
       pre_dispatch='2*n_jobs', refit=True, scoring='roc_auc', verbose=0)



In [18]:

    
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_









    Out[18]:





([mean: 0.83821, std: 0.01092, params: {'min_samples_split': 1000, 'min_samples_leaf': 30},
  mean: 0.83889, std: 0.01271, params: {'min_samples_split': 1200, 'min_samples_leaf': 30},
  mean: 0.83552, std: 0.01024, params: {'min_samples_split': 1400, 'min_samples_leaf': 30},
  mean: 0.83683, std: 0.01429, params: {'min_samples_split': 1600, 'min_samples_leaf': 30},
  mean: 0.83958, std: 0.01233, params: {'min_samples_split': 1800, 'min_samples_leaf': 30},
  mean: 0.83783, std: 0.01137, params: {'min_samples_split': 2000, 'min_samples_leaf': 30},
  mean: 0.83821, std: 0.00872, params: {'min_samples_split': 1000, 'min_samples_leaf': 40},
  mean: 0.83740, std: 0.01280, params: {'min_samples_split': 1200, 'min_samples_leaf': 40},
  mean: 0.83714, std: 0.01019, params: {'min_samples_split': 1400, 'min_samples_leaf': 40},
  mean: 0.83771, std: 0.01188, params: {'min_samples_split': 1600, 'min_samples_leaf': 40},
  mean: 0.83738, std: 0.01370, params: {'min_samples_split': 1800, 'min_samples_leaf': 40},
  mean: 0.83765, std: 0.01221, params: {'min_samples_split': 2000, 'min_samples_leaf': 40},
  mean: 0.83575, std: 0.01017, params: {'min_samples_split': 1000, 'min_samples_leaf': 50},
  mean: 0.83744, std: 0.01224, params: {'min_samples_split': 1200, 'min_samples_leaf': 50},
  mean: 0.83892, std: 0.01234, params: {'min_samples_split': 1400, 'min_samples_leaf': 50},
  mean: 0.83814, std: 0.01354, params: {'min_samples_split': 1600, 'min_samples_leaf': 50},
  mean: 0.83824, std: 0.01116, params: {'min_samples_split': 1800, 'min_samples_leaf': 50},
  mean: 0.83821, std: 0.01014, params: {'min_samples_split': 2000, 'min_samples_leaf': 50},
  mean: 0.83626, std: 0.01111, params: {'min_samples_split': 1000, 'min_samples_leaf': 60},
  mean: 0.83959, std: 0.00989, params: {'min_samples_split': 1200, 'min_samples_leaf': 60},
  mean: 0.83735, std: 0.01217, params: {'min_samples_split': 1400, 'min_samples_leaf': 60},
  mean: 0.83685, std: 0.01325, params: {'min_samples_split': 1600, 'min_samples_leaf': 60},
  mean: 0.83589, std: 0.01101, params: {'min_samples_split': 1800, 'min_samples_leaf': 60},
  mean: 0.83769, std: 0.01173, params: {'min_samples_split': 2000, 'min_samples_leaf': 60},
  mean: 0.83792, std: 0.00994, params: {'min_samples_split': 1000, 'min_samples_leaf': 70},
  mean: 0.83712, std: 0.01053, params: {'min_samples_split': 1200, 'min_samples_leaf': 70},
  mean: 0.83777, std: 0.01186, params: {'min_samples_split': 1400, 'min_samples_leaf': 70},
  mean: 0.83812, std: 0.01126, params: {'min_samples_split': 1600, 'min_samples_leaf': 70},
  mean: 0.83812, std: 0.01055, params: {'min_samples_split': 1800, 'min_samples_leaf': 70},
  mean: 0.83677, std: 0.01190, params: {'min_samples_split': 2000, 'min_samples_leaf': 70}],
 {'min_samples_leaf': 60, 'min_samples_split': 1200},
 0.83959087132827259)



In [49]:

    
modelfit(gsearch3.best_estimator_, train, test, predictors)









    



Model Report
Accuracy : 0.9854
AUC Score (Train): 0.896453
CV Score : Mean - 0.8395909 | Std - 0.009890497 | Min - 0.8259075 | Max - 0.8527672

Tune max_features:



In [20]:

    
#Grid seach on subsample and max_features
param_test4 = {'max_features':range(7,20,2)}
gsearch4 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60,max_depth=9, 
                            min_samples_split=1200, min_samples_leaf=60, subsample=0.8, random_state=10),
                       param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch4.fit(train[predictors],train[target])









    Out[20]:





GridSearchCV(cv=5, error_score='raise',
       estimator=GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=9, max_features=None, max_leaf_nodes=None,
              min_samples_leaf=60, min_samples_split=1200,
              min_weight_fraction_leaf=0.0, n_estimators=60,
              presort='auto', random_state=10, subsample=0.8, verbose=0,
              warm_start=False),
       fit_params={}, iid=False, n_jobs=4,
       param_grid={'max_features': [7, 9, 11, 13, 15, 17, 19]},
       pre_dispatch='2*n_jobs', refit=True, scoring='roc_auc', verbose=0)



In [22]:

    
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_









    Out[22]:





([mean: 0.83959, std: 0.00989, params: {'max_features': 7},
  mean: 0.83648, std: 0.00988, params: {'max_features': 9},
  mean: 0.83919, std: 0.01042, params: {'max_features': 11},
  mean: 0.83738, std: 0.01017, params: {'max_features': 13},
  mean: 0.83820, std: 0.01017, params: {'max_features': 15},
  mean: 0.83495, std: 0.00957, params: {'max_features': 17},
  mean: 0.83499, std: 0.00996, params: {'max_features': 19}],
 {'max_features': 7},
 0.83959087132827259)

Step3- Tune Subsample and Lower Learning Rate



In [23]:

    
#Grid seach on subsample and max_features
param_test5 = {'subsample':[0.6,0.7,0.75,0.8,0.85,0.9]}
gsearch5 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60,max_depth=9, 
                            min_samples_split=1200, min_samples_leaf=60, subsample=0.8, random_state=10, max_features=7),
                       param_grid = param_test5, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch5.fit(train[predictors],train[target])









    Out[23]:





GridSearchCV(cv=5, error_score='raise',
       estimator=GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=9, max_features=7, max_leaf_nodes=None,
              min_samples_leaf=60, min_samples_split=1200,
              min_weight_fraction_leaf=0.0, n_estimators=60,
              presort='auto', random_state=10, subsample=0.8, verbose=0,
              warm_start=False),
       fit_params={}, iid=False, n_jobs=4,
       param_grid={'subsample': [0.6, 0.7, 0.75, 0.8, 0.85, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, scoring='roc_auc', verbose=0)



In [24]:

    
gsearch5.grid_scores_, gsearch5.best_params_, gsearch5.best_score_









    Out[24]:





([mean: 0.83621, std: 0.00950, params: {'subsample': 0.6},
  mean: 0.83648, std: 0.01181, params: {'subsample': 0.7},
  mean: 0.83601, std: 0.01074, params: {'subsample': 0.75},
  mean: 0.83959, std: 0.00989, params: {'subsample': 0.8},
  mean: 0.83989, std: 0.01078, params: {'subsample': 0.85},
  mean: 0.83827, std: 0.01076, params: {'subsample': 0.9}],
 {'subsample': 0.85},
 0.83988852960292915)

With all tuned lets try reducing the learning rate and proportionally increasing the number of estimators to get more robust results:



In [26]:

    
#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
gbm_tuned_1 = GradientBoostingClassifier(learning_rate=0.05, n_estimators=120,max_depth=9, min_samples_split=1200, 
                                         min_samples_leaf=60, subsample=0.85, random_state=10, max_features=7)
modelfit(gbm_tuned_1, train, test, predictors)









    



Model Report
Accuracy : 0.9854
AUC Score (Train): 0.897471
CV Score : Mean - 0.8396 | Std - 0.009514 | Min - 0.8266 | Max - 0.8516

1/10th learning rate



In [29]:

    
#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
gbm_tuned_2 = GradientBoostingClassifier(learning_rate=0.01, n_estimators=600,max_depth=9, min_samples_split=1200, 
                                         min_samples_leaf=60, subsample=0.85, random_state=10, max_features=7)
modelfit(gbm_tuned_2, train, test, predictors)









    



Model Report
Accuracy : 0.9854
AUC Score (Train): 0.899927
CV Score : Mean - 0.8409339 | Std - 0.01035658 | Min - 0.8258238 | Max - 0.8529458

1/50th learning rate



In [43]:

    
#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
gbm_tuned_3 = GradientBoostingClassifier(learning_rate=0.005, n_estimators=1200,max_depth=9, min_samples_split=1200, 
                                         min_samples_leaf=60, subsample=0.85, random_state=10, max_features=7,
                                         warm_start=True)
modelfit(gbm_tuned_3, train, test, predictors, performCV=False)









    



Model Report
Accuracy : 0.9854
AUC Score (Train): 0.900688



In [46]:

    
#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
gbm_tuned_4 = GradientBoostingClassifier(learning_rate=0.005, n_estimators=1500,max_depth=9, min_samples_split=1200, 
                                         min_samples_leaf=60, subsample=0.85, random_state=10, max_features=7,
                                         warm_start=True)
modelfit(gbm_tuned_4, train, test, predictors, performCV=False)









    



Model Report
Accuracy : 0.9854
AUC Score (Train): 0.906346